{
  "cells": [
    {
      "cell_type": "markdown",
      "source": [
        "*![Cohere-Logo-Color-RGB.png]()*"
      ],
      "metadata": {
        "id": "KA0rmEN-XNsd"
      }
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "id": "F_jICWZXSVA8"
      },
      "outputs": [],
      "source": [
        "%%capture\n",
        "!pip install cohere\n",
        "!pip install -qU langchain-text-splitters\n",
        "!pip install llama-index-embeddings-cohere\n",
        "!pip install llama-index-postprocessor-cohere-rerank"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "id": "A1Q6ki-iRL4g"
      },
      "outputs": [],
      "source": [
        "import requests\n",
        "from typing import List\n",
        "\n",
        "from bs4 import BeautifulSoup\n",
        "\n",
        "import cohere\n",
        "from getpass import getpass\n",
        "from IPython.display import HTML, display\n",
        "\n",
        "from langchain_text_splitters import CharacterTextSplitter\n",
        "from langchain_text_splitters import RecursiveCharacterTextSplitter\n",
        "\n",
        "from llama_index.core import Document\n",
        "from llama_index.embeddings.cohere import CohereEmbedding\n",
        "from llama_index.postprocessor.cohere_rerank import CohereRerank\n",
        "from llama_index.core import VectorStoreIndex, ServiceContext"
      ]
    },
    {
      "cell_type": "code",
      "source": [
        "# Set up Cohere client\n",
        "co_model = 'command-r'\n",
        "co_api_key = getpass(\"Enter Cohere API key: \")\n",
        "co = cohere.Client(api_key=co_api_key)"
      ],
      "metadata": {
        "colab": {
          "base_uri": "https://localhost:8080/"
        },
        "id": "wznWW57-BAML",
        "outputId": "56de0fa7-7321-4718-8133-8740d405e78b"
      },
      "execution_count": null,
      "outputs": [
        {
          "name": "stdout",
          "output_type": "stream",
          "text": [
            "Enter Cohere API key: ··········\n"
          ]
        }
      ]
    },
    {
      "cell_type": "markdown",
      "source": [
        "\n",
        "# Introduction\n",
        "\n",
        "Chunking is an essential component of any RAG-based system. This cookbook aims to demonstrate how different chunking strategies affect the results of LLM-generated output. There are multiple considerations that need to be taken into account when designing chunking strategy. Therefore, we begin by providing a framework for these strategies and then jump into a practical example. We will focus our example on transcript calls, which create a unique challenge because of their rich content and the change of people speaking throughout the text.\n",
        "\n",
        "# Table of content\n",
        "\n",
        "1. [Chunking Strategies Framework](#framework)\n",
        "2. [Getting started](#getting-started)\n",
        "3. [Example 1: Chunking using content-independent strategies](#example-1)\n",
        "4. [Example 2: Chunking using content-dependent strategies](#example-2)\n",
        "5. [Discussion](#discussion)"
      ],
      "metadata": {
        "id": "pIJH7u4dL7mZ"
      }
    },
    {
      "cell_type": "markdown",
      "source": [
        "<a name=\"framework\"></a>\n",
        "# Chunking Strategies Framework\n",
        "\n",
        "## Document splitting\n",
        "\n",
        "By document splitting, we mean deciding on the conditions under which we will break the text. At this stage, we should ask, *\"Are there any parts of consecutive text we want to ensure we do not break?\"*. If the answer is \"no\", then, the content-independent splitting strategies are helpful. On the other hand, in scenarios like transcripts or meeting notes, we probably would like to keep the content of one speaker together, which might require us to deploy content-dependent strategies.\n",
        "\n",
        "### Content-independent splitting strategies\n",
        "\n",
        "We split the document based on some content-independent conditions, among the most popular ones are:\n",
        "- splitting by the number of characters,\n",
        "- splitting by sentence,\n",
        "- splitting by a given character, for example, `\\n` for paragraphs.\n",
        "\n",
        "The advantage of this scenario is that we do not need to make any assumptions about the text. However, some considerations remain, like whether we want to preserve some semantic structure, for example, sentences or paragraphs. Sentence splitting is better suited if we are looking for small chunks to ensure accuracy. Conversely, paragraphs preserve more context and might be more useful in open-ended questions.\n",
        "\n",
        "### Content-dependent splitting strategies\n",
        "\n",
        "On the other hand, there are scenarios in which we care about preserving some text structure. Then, we develop custom splitting strategies based on the document's content. A prime example is call transcripts. In such scenarios, we aim to ensure that one person's speech is fully contained within a chunk.\n",
        "\n",
        "## Creating chunks from the document splits\n",
        "\n",
        "After the document is split, we need to decide on the desired **size** of our chunks (the split only defines how we break the document, but we can create bigger chunks from multiple splits).\n",
        "\n",
        "Smaller chunks support more accurate retrieval. However, they might lack context. On the other hand, larger chunks offer more context, but they reduce the effectiveness of the retrieval. It is important to experiment with different settings to find the optimal balance.\n",
        "\n",
        "## Overlapping chunks\n",
        "\n",
        "Overlapping chunks is a useful technique to have in the toolbox. Especially when we employ content-independent splitting strategies, it helps us mitigate some of the pitfalls of breaking the document without fully understanding the text. Overlapping guarantees that there is always some buffer between the chunks, and even if an important piece of information might be split in the original splitting strategy, it is more probable that the full information will be captured in the next chunk. The disadvantage of this method is that it creates redundancy."
      ],
      "metadata": {
        "id": "Yuse4PR_9saD"
      }
    },
    {
      "cell_type": "markdown",
      "source": [
        "<a name=\"getting-stated\"></a>\n",
        "# Getting started\n",
        "\n",
        "Designing a robust chunking strategy is as much a science as an art. There are no straightforward answers; the most effective strategies often emerge through experimentation. Therefore, let's dive straight into an example to illustrate this concept.\n",
        "\n",
        "\n",
        "\n",
        "\n",
        "\n"
      ],
      "metadata": {
        "id": "eKdDiizv_QtR"
      }
    },
    {
      "cell_type": "markdown",
      "source": [
        "## Utils"
      ],
      "metadata": {
        "id": "1ivVb-oMCAaz"
      }
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "colab": {
          "base_uri": "https://localhost:8080/",
          "height": 17
        },
        "id": "Ky0792b5SmqP",
        "outputId": "a74aae04-48c4-420a-b500-a513f51a6731"
      },
      "outputs": [
        {
          "output_type": "display_data",
          "data": {
            "text/plain": [
              "<IPython.core.display.HTML object>"
            ],
            "text/html": [
              "\n",
              "  <style>/\n",
              "    pre {\n",
              "        white-space: pre-wrap;\n",
              "    }\n",
              "  </style>\n",
              "  "
            ]
          },
          "metadata": {}
        }
      ],
      "source": [
        "def set_css():\n",
        "  display(HTML('''\n",
        "  <style>/\n",
        "    pre {\n",
        "        white-space: pre-wrap;\n",
        "    }\n",
        "  </style>\n",
        "  '''))\n",
        "get_ipython().events.register('pre_run_cell', set_css)\n",
        "\n",
        "set_css()"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "colab": {
          "base_uri": "https://localhost:8080/",
          "height": 17
        },
        "id": "8UjX34fDWfSS",
        "outputId": "44641063-7c3c-4162-81f0-ac983481af1a"
      },
      "outputs": [
        {
          "output_type": "display_data",
          "data": {
            "text/plain": [
              "<IPython.core.display.HTML object>"
            ],
            "text/html": [
              "\n",
              "  <style>/\n",
              "    pre {\n",
              "        white-space: pre-wrap;\n",
              "    }\n",
              "  </style>\n",
              "  "
            ]
          },
          "metadata": {}
        }
      ],
      "source": [
        "def insert_citations(text: str, citations: List[dict]):\n",
        "    \"\"\"\n",
        "    A helper function to pretty print citations.\n",
        "    \"\"\"\n",
        "    offset = 0\n",
        "    # Process citations in the order they were provided\n",
        "    for citation in citations:\n",
        "        # Adjust start/end with offset\n",
        "        start, end = citation['start'] + offset, citation['end'] + offset\n",
        "        placeholder = \"[\" + \", \".join(doc[4:] for doc in citation[\"document_ids\"]) + \"]\"\n",
        "        # ^ doc[4:] removes the 'doc_' prefix, and leaves the quoted document\n",
        "        modification = f'{text[start:end]} {placeholder}'\n",
        "        # Replace the cited text with its bolded version + placeholder\n",
        "        text = text[:start] + modification + text[end:]\n",
        "        # Update the offset for subsequent replacements\n",
        "        offset += len(modification) - (end - start)\n",
        "\n",
        "    return text\n",
        "\n",
        "def build_retreiver(documents, top_n=5):\n",
        "  # Create the embedding model\n",
        "  embed_model = CohereEmbedding(\n",
        "      cohere_api_key=co_api_key,\n",
        "      model_name=\"embed-english-v3.0\",\n",
        "      input_type=\"search_query\",\n",
        "  )\n",
        "\n",
        "  # Load the data, for this example data needs to be in a test file\n",
        "  index = VectorStoreIndex.from_documents(\n",
        "      documents,\n",
        "      embed_model=embed_model\n",
        "  )\n",
        "\n",
        "  # Create a cohere reranker\n",
        "  cohere_rerank = CohereRerank(api_key=co_api_key)\n",
        "\n",
        "  # Create the retriever\n",
        "  retriever = index.as_retriever(node_postprocessors=[cohere_rerank], similarity_top_k=top_n)\n",
        "  return retriever"
      ]
    },
    {
      "cell_type": "markdown",
      "source": [
        "##Load the data\n",
        "\n",
        "In this example we will work with an 2023 Tesla earning call transcript."
      ],
      "metadata": {
        "id": "AUXfWr13CyFE"
      }
    },
    {
      "cell_type": "code",
      "source": [
        "# Get all investement memos (19) in bvp repository\n",
        "url_path = 'https://www.fool.com/earnings/call-transcripts/2024/01/24/tesla-tsla-q4-2023-earnings-call-transcript/'\n",
        "response = requests.get(url_path)\n",
        "soup = BeautifulSoup(response.content, 'html.parser')\n",
        "\n",
        "target_divs = soup.find(\"div\", {\"class\": \"article-body\"}).find_all(\"p\")[2:]\n",
        "print('Length of the script: ', len(target_divs))\n",
        "\n",
        "print()\n",
        "print('Example of processed text:')\n",
        "text = '\\n\\n'.join([div.get_text() for div in target_divs])\n",
        "print(text[:500])"
      ],
      "metadata": {
        "colab": {
          "base_uri": "https://localhost:8080/",
          "height": 180
        },
        "id": "NeSVXvO_e1DH",
        "outputId": "5042cbef-d3e7-4a78-b921-aa269e60570e"
      },
      "execution_count": null,
      "outputs": [
        {
          "output_type": "display_data",
          "data": {
            "text/plain": [
              "<IPython.core.display.HTML object>"
            ],
            "text/html": [
              "\n",
              "  <style>/\n",
              "    pre {\n",
              "        white-space: pre-wrap;\n",
              "    }\n",
              "  </style>\n",
              "  "
            ]
          },
          "metadata": {}
        },
        {
          "output_type": "stream",
          "name": "stdout",
          "text": [
            "Length of the script:  385\n",
            "\n",
            "Example of processed text:\n",
            "Martin Viecha\n",
            "\n",
            "Good afternoon, everyone, and welcome to Tesla's fourth-quarter 2023 Q&A webcast. My name is Martin Viecha, VP of investor relations, and I'm joined today by Elon Musk, Vaibhav Taneja, and a number of other executives. Our Q4 results were announced at about 3 p.m. Central Time in the update that we published at the same link as this webcast.\n",
            "\n",
            "During this call, we will discuss our business outlook and make forward-looking statements. These comments are based on our predictions and \n"
          ]
        }
      ]
    },
    {
      "cell_type": "markdown",
      "source": [
        "<a name=\"example-1\"></a>\n",
        "# Example 1: Chunking using content-independent strategies\n",
        "\n",
        "Let's begin with a simple content-independent strategy. We aim to answer the question, `Who mentiones Jonathan Nolan?`. We chose this question as it is easily verifiable and it requires to identify the speaker. The answer to this questions can be found in the dowloaded transcscript, here is the relevant passage:\n",
        "\n",
        "\n",
        "```\n",
        "Elon Musk -- Chief Executive Officer and Product Architect\n",
        "\n",
        "Yeah. The creators of Westworld, Jonathan Nolan, Lisa Joy Nolan, are friends -- are all friends of mine, actually. And I invited them to come see the lab and, like, well, come see it, hopefully soon. It's pretty well -- especially the sort of subsystem test stands where you've just got like one leg on a test stand just doing repetitive exercises and one arm on a test stand pretty well.\n",
        "```"
      ],
      "metadata": {
        "id": "UhlsdHPmhoMT"
      }
    },
    {
      "cell_type": "code",
      "source": [
        "# Define the question\n",
        "question = \"Who mentiones Jonathan Nolan?\""
      ],
      "metadata": {
        "colab": {
          "base_uri": "https://localhost:8080/",
          "height": 17
        },
        "id": "P4xy_imcEj10",
        "outputId": "a2904832-788a-41d8-ae10-c9049677547c"
      },
      "execution_count": null,
      "outputs": [
        {
          "output_type": "display_data",
          "data": {
            "text/plain": [
              "<IPython.core.display.HTML object>"
            ],
            "text/html": [
              "\n",
              "  <style>/\n",
              "    pre {\n",
              "        white-space: pre-wrap;\n",
              "    }\n",
              "  </style>\n",
              "  "
            ]
          },
          "metadata": {}
        }
      ]
    },
    {
      "cell_type": "markdown",
      "source": [
        "In this case, we are more concerned about accuracy than a verbose answer, so we **focus on keeping the chunks small**. To ensure that the desired size is not exceeded, we will randomly split the list of characters, in our case `[\"\\n\\n\", \"\\n\", \" \", \"\"]`.\n",
        "\n",
        "We employ the `RecursiveCharacterTextSplitter` from [LangChain](https://python.langchain.com/docs/get_started/introduction) for this task."
      ],
      "metadata": {
        "id": "cy8IXqA4H2ET"
      }
    },
    {
      "cell_type": "code",
      "source": [
        "# Define the chunking function\n",
        "def get_chunks(text, chunk_size, chunk_overlap):\n",
        "  text_splitter = RecursiveCharacterTextSplitter(\n",
        "    chunk_size=chunk_size,\n",
        "    chunk_overlap=chunk_overlap,\n",
        "    length_function=len,\n",
        "    is_separator_regex=False,\n",
        "  )\n",
        "\n",
        "  documents = text_splitter.create_documents([text])\n",
        "  documents = [Document(text=doc.page_content) for doc in documents]\n",
        "\n",
        "  return documents"
      ],
      "metadata": {
        "colab": {
          "base_uri": "https://localhost:8080/",
          "height": 17
        },
        "id": "xy69i0i7xIy1",
        "outputId": "08c4a509-bab3-4f9b-9c98-cb2ad71ee59b"
      },
      "execution_count": null,
      "outputs": [
        {
          "output_type": "display_data",
          "data": {
            "text/plain": [
              "<IPython.core.display.HTML object>"
            ],
            "text/html": [
              "\n",
              "  <style>/\n",
              "    pre {\n",
              "        white-space: pre-wrap;\n",
              "    }\n",
              "  </style>\n",
              "  "
            ]
          },
          "metadata": {}
        }
      ]
    },
    {
      "cell_type": "markdown",
      "source": [
        "### Experiment 1 - no overlap\n",
        "In our first experiment we define the chunk size as 500 and allow **no overlap between consecutive chunks**.\n",
        "\n",
        "Subsequently, we implement the standard RAG pipeline. We feed the chunks into a retriever, selecting the `top_n` most pertinent to the query chunks, and supply them as context to the generation model. Throughout this pipeline, we leverage [Cohere's endpoints](https://docs.cohere.com/reference/about), specifically, `co.embed`, `co.re.rank`, and finally, `co.chat`."
      ],
      "metadata": {
        "id": "4YXsoVMMGCnC"
      }
    },
    {
      "cell_type": "code",
      "source": [
        "chunk_size = 500\n",
        "chunk_overlap = 0\n",
        "documents = get_chunks(text, chunk_size, chunk_overlap)\n",
        "retriever = build_retreiver(documents)\n",
        "\n",
        "source_nodes = retriever.retrieve(question)\n",
        "print('Number of docuemnts: ',len(source_nodes))\n",
        "source_nodes= [{\"text\": ni.get_content()}for ni in source_nodes]\n",
        "\n",
        "\n",
        "response = co.chat(\n",
        "  message=question,\n",
        "  documents=source_nodes,\n",
        "  model=co_model\n",
        ")\n",
        "response = response\n",
        "print(response.text)"
      ],
      "metadata": {
        "colab": {
          "base_uri": "https://localhost:8080/",
          "height": 73
        },
        "id": "HAG0ybf8xbcB",
        "outputId": "6a5f2b54-f4aa-4674-8684-7601d47bcd10"
      },
      "execution_count": null,
      "outputs": [
        {
          "output_type": "display_data",
          "data": {
            "text/plain": [
              "<IPython.core.display.HTML object>"
            ],
            "text/html": [
              "\n",
              "  <style>/\n",
              "    pre {\n",
              "        white-space: pre-wrap;\n",
              "    }\n",
              "  </style>\n",
              "  "
            ]
          },
          "metadata": {}
        },
        {
          "output_type": "stream",
          "name": "stdout",
          "text": [
            "Number of docuemnts:  5\n",
            "An unknown speaker mentions Jonathan Nolan in a conversation about the creators of Westworld. They mention that Jonathan Nolan and Lisa Joy Nolan are friends of theirs, and that they have invited them to visit the lab.\n"
          ]
        }
      ]
    },
    {
      "cell_type": "markdown",
      "source": [
        "A notable feature of [`co.chat`](https://docs.cohere.com/reference/chat) is its ability to ground the model's answer within the context. This means we can identify which chunks were used to generate the answer. Below, we show the previous output of the model together with the citation reference, where `[num]` represents the index of the chunk."
      ],
      "metadata": {
        "id": "zbXfrN125I-i"
      }
    },
    {
      "cell_type": "code",
      "source": [
        "print(insert_citations(response.text, response.citations))"
      ],
      "metadata": {
        "colab": {
          "base_uri": "https://localhost:8080/",
          "height": 55
        },
        "id": "pyLy-OwBGcut",
        "outputId": "0e4a63b0-2cf2-4021-8cb7-4292a7813644"
      },
      "execution_count": null,
      "outputs": [
        {
          "output_type": "display_data",
          "data": {
            "text/plain": [
              "<IPython.core.display.HTML object>"
            ],
            "text/html": [
              "\n",
              "  <style>/\n",
              "    pre {\n",
              "        white-space: pre-wrap;\n",
              "    }\n",
              "  </style>\n",
              "  "
            ]
          },
          "metadata": {}
        },
        {
          "output_type": "stream",
          "name": "stdout",
          "text": [
            "An unknown speaker [0] mentions Jonathan Nolan in a conversation about the creators of Westworld. [0] They mention that Jonathan Nolan and Lisa Joy Nolan [0] are friends [0] of theirs, and that they have invited them to visit the lab. [0]\n"
          ]
        }
      ]
    },
    {
      "cell_type": "markdown",
      "source": [
        "Indeed, by printing the cited chunk, we can validate that the text was divided so that the generation model could not provide the correct response. Notably, the speaker's name is not included in the context, which is why the model refes to an `unknown speaker`."
      ],
      "metadata": {
        "id": "X07wAmqLG3Mf"
      }
    },
    {
      "cell_type": "code",
      "source": [
        "print(source_nodes[0])"
      ],
      "metadata": {
        "colab": {
          "base_uri": "https://localhost:8080/",
          "height": 55
        },
        "id": "StbIRRQWGei3",
        "outputId": "856b8735-661a-4bdb-e859-67082a56f6af"
      },
      "execution_count": null,
      "outputs": [
        {
          "output_type": "display_data",
          "data": {
            "text/plain": [
              "<IPython.core.display.HTML object>"
            ],
            "text/html": [
              "\n",
              "  <style>/\n",
              "    pre {\n",
              "        white-space: pre-wrap;\n",
              "    }\n",
              "  </style>\n",
              "  "
            ]
          },
          "metadata": {}
        },
        {
          "output_type": "stream",
          "name": "stdout",
          "text": [
            "{'text': \"Yeah. The creators of Westworld, Jonathan Nolan, Lisa Joy Nolan, are friends -- are all friends of mine, actually. And I invited them to come see the lab and, like, well, come see it, hopefully soon. It's pretty well -- especially the sort of subsystem test stands where you've just got like one leg on a test stand just doing repetitive exercises and one arm on a test stand pretty well.\\n\\nYeah.\\n\\nUnknown speaker\\n\\nWe're not entering Westworld anytime soon.\"}\n"
          ]
        }
      ]
    },
    {
      "cell_type": "markdown",
      "source": [
        "### Experiment 2 - allow overlap\n",
        "In the previous experiment, we discovered that the chunks were generated in a way that made it impossible to generate the correct answer. The name of the speaker was not included in the relevant chunk.\n",
        "\n",
        "Therefore, this time to mitigate this issue, we **allow for overlap between consecutive chunks**."
      ],
      "metadata": {
        "id": "Iu91q_IhHZoh"
      }
    },
    {
      "cell_type": "code",
      "source": [
        "chunk_size = 500\n",
        "chunk_overlap = 100\n",
        "documents = get_chunks(text,chunk_size, chunk_overlap)\n",
        "retriever = build_retreiver(documents)\n",
        "\n",
        "source_nodes = retriever.retrieve(question)\n",
        "print('Number of docuemnts: ',len(source_nodes))\n",
        "source_nodes= [{\"text\": ni.get_content()}for ni in source_nodes]\n",
        "\n",
        "\n",
        "response = co.chat(\n",
        "  message=question,\n",
        "  documents=source_nodes,\n",
        "  model=co_model\n",
        ")\n",
        "response = response\n",
        "print(response.text)"
      ],
      "metadata": {
        "colab": {
          "base_uri": "https://localhost:8080/",
          "height": 73
        },
        "id": "Za9HW8z_xog_",
        "outputId": "f3722475-44f2-40b1-c3d7-07cf603b4994"
      },
      "execution_count": null,
      "outputs": [
        {
          "output_type": "display_data",
          "data": {
            "text/plain": [
              "<IPython.core.display.HTML object>"
            ],
            "text/html": [
              "\n",
              "  <style>/\n",
              "    pre {\n",
              "        white-space: pre-wrap;\n",
              "    }\n",
              "  </style>\n",
              "  "
            ]
          },
          "metadata": {}
        },
        {
          "output_type": "stream",
          "name": "stdout",
          "text": [
            "Number of docuemnts:  5\n",
            "Elon Musk mentions Jonathan Nolan. Musk is the CEO and Product Architect of the lab that resembles the set of Westworld, a show created by Jonathan Nolan and Lisa Joy Nolan.\n"
          ]
        }
      ]
    },
    {
      "cell_type": "markdown",
      "source": [
        "Again, we can print the text along with the citations."
      ],
      "metadata": {
        "id": "G1Ia5b-q5UKg"
      }
    },
    {
      "cell_type": "code",
      "source": [
        "print(insert_citations(response.text, response.citations))"
      ],
      "metadata": {
        "colab": {
          "base_uri": "https://localhost:8080/",
          "height": 55
        },
        "id": "yluOCX4I9WcV",
        "outputId": "c5bb53b7-8029-4f4c-b48c-fc2d5f4df466"
      },
      "execution_count": null,
      "outputs": [
        {
          "output_type": "display_data",
          "data": {
            "text/plain": [
              "<IPython.core.display.HTML object>"
            ],
            "text/html": [
              "\n",
              "  <style>/\n",
              "    pre {\n",
              "        white-space: pre-wrap;\n",
              "    }\n",
              "  </style>\n",
              "  "
            ]
          },
          "metadata": {}
        },
        {
          "output_type": "stream",
          "name": "stdout",
          "text": [
            "Elon Musk [0] mentions Jonathan Nolan. Musk is the CEO and Product Architect [0] of the lab [0] that resembles the set of Westworld [0], a show created by Jonathan Nolan [0] and Lisa Joy Nolan. [0]\n"
          ]
        }
      ]
    },
    {
      "cell_type": "markdown",
      "source": [
        "And investigate the chunks which were used as context to answer the query.\n",
        "\n"
      ],
      "metadata": {
        "id": "pMleTdDCH22o"
      }
    },
    {
      "cell_type": "code",
      "source": [
        "source_nodes[0]"
      ],
      "metadata": {
        "colab": {
          "base_uri": "https://localhost:8080/",
          "height": 106
        },
        "id": "WdQP7ghK4GLx",
        "outputId": "ab3bd918-5c7b-4411-da3a-654b87b8fa0c"
      },
      "execution_count": null,
      "outputs": [
        {
          "output_type": "display_data",
          "data": {
            "text/plain": [
              "<IPython.core.display.HTML object>"
            ],
            "text/html": [
              "\n",
              "  <style>/\n",
              "    pre {\n",
              "        white-space: pre-wrap;\n",
              "    }\n",
              "  </style>\n",
              "  "
            ]
          },
          "metadata": {}
        },
        {
          "output_type": "execute_result",
          "data": {
            "text/plain": [
              "{'text': \"Yeah, not the best reference.\\n\\nElon Musk -- Chief Executive Officer and Product Architect\\n\\nYeah. The creators of Westworld, Jonathan Nolan, Lisa Joy Nolan, are friends -- are all friends of mine, actually. And I invited them to come see the lab and, like, well, come see it, hopefully soon. It's pretty well -- especially the sort of subsystem test stands where you've just got like one leg on a test stand just doing repetitive exercises and one arm on a test stand pretty well.\\n\\nYeah.\"}"
            ]
          },
          "metadata": {},
          "execution_count": 14
        }
      ]
    },
    {
      "cell_type": "markdown",
      "source": [
        "As we can see, by allowing overlap we managed to get the correct answer to our question."
      ],
      "metadata": {
        "id": "ppL1npzNLSpQ"
      }
    },
    {
      "cell_type": "markdown",
      "source": [
        "<a name=\"example-2\"></a>\n",
        "# Example 2: Chunking using content-dependent strategies\n",
        "\n",
        "In the previous experiment, we provided an example of how using or not using overlapping can affect a model's performance,  particularly in documents such as call transcripts where subjects change frequently. Ensuring that each chunk contains all relevant information is crucial. While we managed to retrieve the correct information by introducing overlapping into the chunking strategy, this might still not be the optimal approach for transcripts with longer speaker speeches.\n",
        "\n",
        "Therefore, in this experiment, we will adopt a content-dependent strategy.\n",
        "\n",
        "Our proposed approach entails segmenting the text whenever a new speaker begins speaking, which requires preprocessing the text accordingly."
      ],
      "metadata": {
        "id": "1IaTId1GzwMQ"
      }
    },
    {
      "cell_type": "markdown",
      "source": [
        "### Preprocess the text\n",
        "\n",
        "Firstly, let's observe that in the HTML text, each time the speaker changes, their name is enclosed within `<p><strong>Name</p></strong>` tags, denoting the speaker's name in bold letters.\n",
        "\n",
        "To facilitate our text chunking process, we'll use the above observation and introduce a unique character sequence `###`, which we'll utilize as a marker for splitting the text."
      ],
      "metadata": {
        "id": "RqRTHKg204yV"
      }
    },
    {
      "cell_type": "code",
      "source": [
        "print('HTML text')\n",
        "print(target_divs[:3])\n",
        "print('-------------------\\n')\n",
        "\n",
        "text_custom = []\n",
        "for div in target_divs:\n",
        "  if div.get_text() is None:\n",
        "    continue\n",
        "  if str(div).startswith('<p><strong>'):\n",
        "    text_custom.append(f'### {div.get_text()}')\n",
        "  else:\n",
        "    text_custom.append(div.get_text())\n",
        "\n",
        "text_custom = '\\n'.join(text_custom)\n",
        "print(text_custom[:500])"
      ],
      "metadata": {
        "colab": {
          "base_uri": "https://localhost:8080/",
          "height": 162
        },
        "id": "nWG44cFCzvwI",
        "outputId": "fc03be90-81d8-4992-9784-aac124ff33ed"
      },
      "execution_count": null,
      "outputs": [
        {
          "output_type": "display_data",
          "data": {
            "text/plain": [
              "<IPython.core.display.HTML object>"
            ],
            "text/html": [
              "\n",
              "  <style>/\n",
              "    pre {\n",
              "        white-space: pre-wrap;\n",
              "    }\n",
              "  </style>\n",
              "  "
            ]
          },
          "metadata": {}
        },
        {
          "output_type": "stream",
          "name": "stdout",
          "text": [
            "HTML text\n",
            "[<p><strong>Martin Viecha</strong></p>, <p>Good afternoon, everyone, and welcome to Tesla's fourth-quarter 2023 Q&amp;A webcast. My name is Martin Viecha, VP of investor relations, and I'm joined today by Elon Musk, Vaibhav Taneja, and a number of other executives. Our Q4 results were announced at about 3 p.m. Central Time in the update that we published at the same link as this webcast.</p>, <p>During this call, we will discuss our business outlook and make forward-looking statements. These comments are based on our predictions and expectations as of today. Actual events or results could differ materially due to a number of risks and uncertainties, including those mentioned in our most recent filings with the SEC. [Operator instructions] But before we jump into Q&amp;A, Elon has some opening remarks.</p>]\n",
            "-------------------\n",
            "\n",
            "### Martin Viecha\n",
            "Good afternoon, everyone, and welcome to Tesla's fourth-quarter 2023 Q&A webcast. My name is Martin Viecha, VP of investor relations, and I'm joined today by Elon Musk, Vaibhav Taneja, and a number of other executives. Our Q4 results were announced at about 3 p.m. Central Time in the update that we published at the same link as this webcast.\n",
            "During this call, we will discuss our business outlook and make forward-looking statements. These comments are based on our predictions an\n"
          ]
        }
      ]
    },
    {
      "cell_type": "markdown",
      "source": [
        "In this approach, we prioritize splitting the text at the appropriate separator, `###.` To ensure this behavior, we'll use `CharacterTextSplitter` from [LangChain](https://python.langchain.com/docs/get_started/introduction), guaranteeing such behavior. From our analysis of the text and the fact that we aim to preserve entire speaker speeches intact, we anticipate that most of them will exceed a length of 500. Hence, we'll increase the chunk size to 1000."
      ],
      "metadata": {
        "id": "6CU0gC4y7Rvv"
      }
    },
    {
      "cell_type": "code",
      "source": [
        "separator = \"###\"\n",
        "chunk_size = 1000\n",
        "chunk_overlap = 0\n",
        "\n",
        "text_splitter = CharacterTextSplitter(\n",
        "    separator = separator,\n",
        "    chunk_size=chunk_size,\n",
        "    chunk_overlap=chunk_overlap,\n",
        "    length_function=len,\n",
        "    is_separator_regex=False,\n",
        ")\n",
        "\n",
        "documents = text_splitter.create_documents([text_custom])\n",
        "documents = [Document(text=doc.page_content) for doc in documents]\n",
        "\n",
        "retriever = build_retreiver(documents)\n",
        "\n",
        "source_nodes = retriever.retrieve(question)\n",
        "print('Number of docuemnts: ',len(source_nodes))\n",
        "source_nodes= [{\"text\": ni.get_content()}for ni in source_nodes]\n",
        "\n",
        "response = co.chat(\n",
        "  message=question,\n",
        "  documents=source_nodes,\n",
        "  model=co_model\n",
        ")\n",
        "response = response\n",
        "print(response.text)"
      ],
      "metadata": {
        "colab": {
          "base_uri": "https://localhost:8080/",
          "height": 231
        },
        "id": "ZFvigf6Szzev",
        "outputId": "99a9e061-7cad-4308-e809-1cd38ed443e1"
      },
      "execution_count": null,
      "outputs": [
        {
          "output_type": "display_data",
          "data": {
            "text/plain": [
              "<IPython.core.display.HTML object>"
            ],
            "text/html": [
              "\n",
              "  <style>/\n",
              "    pre {\n",
              "        white-space: pre-wrap;\n",
              "    }\n",
              "  </style>\n",
              "  "
            ]
          },
          "metadata": {}
        },
        {
          "output_type": "stream",
          "name": "stderr",
          "text": [
            "WARNING:langchain_text_splitters.base:Created a chunk of size 5946, which is longer than the specified 1000\n",
            "WARNING:langchain_text_splitters.base:Created a chunk of size 4092, which is longer than the specified 1000\n",
            "WARNING:langchain_text_splitters.base:Created a chunk of size 1782, which is longer than the specified 1000\n",
            "WARNING:langchain_text_splitters.base:Created a chunk of size 1392, which is longer than the specified 1000\n",
            "WARNING:langchain_text_splitters.base:Created a chunk of size 2046, which is longer than the specified 1000\n",
            "WARNING:langchain_text_splitters.base:Created a chunk of size 1152, which is longer than the specified 1000\n",
            "WARNING:langchain_text_splitters.base:Created a chunk of size 1304, which is longer than the specified 1000\n",
            "WARNING:langchain_text_splitters.base:Created a chunk of size 1295, which is longer than the specified 1000\n",
            "WARNING:langchain_text_splitters.base:Created a chunk of size 2090, which is longer than the specified 1000\n",
            "WARNING:langchain_text_splitters.base:Created a chunk of size 1251, which is longer than the specified 1000\n"
          ]
        },
        {
          "output_type": "stream",
          "name": "stdout",
          "text": [
            "Number of docuemnts:  5\n",
            "Elon Musk mentions Jonathan Nolan. Musk is friends with the creators of Westworld, Jonathan Nolan and Lisa Joy Nolan.\n"
          ]
        }
      ]
    },
    {
      "cell_type": "markdown",
      "source": [
        "Below we validate the answer using citations."
      ],
      "metadata": {
        "id": "TH62W6ek7Iom"
      }
    },
    {
      "cell_type": "code",
      "source": [
        "print(insert_citations(response.text, response.citations))"
      ],
      "metadata": {
        "colab": {
          "base_uri": "https://localhost:8080/",
          "height": 55
        },
        "id": "1wPQZRMr1aX2",
        "outputId": "939f6b3a-d969-487e-83d9-f64074a538e6"
      },
      "execution_count": null,
      "outputs": [
        {
          "output_type": "display_data",
          "data": {
            "text/plain": [
              "<IPython.core.display.HTML object>"
            ],
            "text/html": [
              "\n",
              "  <style>/\n",
              "    pre {\n",
              "        white-space: pre-wrap;\n",
              "    }\n",
              "  </style>\n",
              "  "
            ]
          },
          "metadata": {}
        },
        {
          "output_type": "stream",
          "name": "stdout",
          "text": [
            "Elon Musk [0] mentions Jonathan Nolan. [0] Musk is friends [0] with the creators of Westworld [0], Jonathan Nolan [0] and Lisa Joy Nolan. [0]\n"
          ]
        }
      ]
    },
    {
      "cell_type": "code",
      "source": [
        "source_nodes[0]"
      ],
      "metadata": {
        "colab": {
          "base_uri": "https://localhost:8080/",
          "height": 142
        },
        "id": "D9vm42za2uhE",
        "outputId": "eeaa7e51-19b7-4ac5-a08a-5e9f603f40fd"
      },
      "execution_count": null,
      "outputs": [
        {
          "output_type": "display_data",
          "data": {
            "text/plain": [
              "<IPython.core.display.HTML object>"
            ],
            "text/html": [
              "\n",
              "  <style>/\n",
              "    pre {\n",
              "        white-space: pre-wrap;\n",
              "    }\n",
              "  </style>\n",
              "  "
            ]
          },
          "metadata": {}
        },
        {
          "output_type": "execute_result",
          "data": {
            "text/plain": [
              "{'text': \"Elon Musk -- Chief Executive Officer and Product Architect\\nYeah. The creators of Westworld, Jonathan Nolan, Lisa Joy Nolan, are friends -- are all friends of mine, actually. And I invited them to come see the lab and, like, well, come see it, hopefully soon. It's pretty well -- especially the sort of subsystem test stands where you've just got like one leg on a test stand just doing repetitive exercises and one arm on a test stand pretty well.\\nYeah.\\n### Unknown speaker\\nWe're not entering Westworld anytime soon.\\n### Elon Musk -- Chief Executive Officer and Product Architect\\nRight, right. Yeah. I take -- take safety very very seriously.\\n### Martin Viecha\\nThank you. The next question from Norman is: How many Cybertruck orders are in the queue? And when do you anticipate to be able to fulfill existing orders?\"}"
            ]
          },
          "metadata": {},
          "execution_count": 18
        }
      ]
    },
    {
      "cell_type": "markdown",
      "source": [
        "<a name=\"discussion\"></a>\n",
        "# Discussion\n",
        "\n",
        "This example highlights some of the concerns that arise when implementing chunking strategies. This is a field of ongoing research, and many exciting surveys have been published in domain-specific applications. For example, this [paper](https://arxiv.org/pdf/2402.05131.pdf) examines different chunking strategies in finance."
      ],
      "metadata": {
        "id": "uhTrS95m9x-7"
      }
    }
  ],
  "metadata": {
    "colab": {
      "provenance": [],
      "collapsed_sections": [
        "1ivVb-oMCAaz"
      ]
    },
    "kernelspec": {
      "display_name": "Python 3",
      "name": "python3"
    },
    "language_info": {
      "name": "python"
    }
  },
  "nbformat": 4,
  "nbformat_minor": 0
}